Lealia Xiong

Data Storyteller | Adaptable Collaborator | Synthetic Biologist | Rapid Prototyper

  • About Me
  • Caltech COVID-19 Tracker
  • Data Science Portfolio
    • Caltech COVID-19 Tracker
    • Local frog discovery tool
  • Synthetic Biology Research
    • Tunable temperature-sensitive transcriptional activation based on lambda repressor
  • STEM Outreach

Predicting median income from health data¶

In [1]:
import numpy as np
import pandas as pd

import sklearn.preprocessing
import sklearn.linear_model

import holoviews as hv
import colorcet as cc

from health import *

hv.extension('bokeh')
In [2]:
blue = cc.b_glasbey_hv[0]
red = cc.b_glasbey_hv[1]

Load, clean, and join data - California¶

Drop 'Limited access to healthy foods' (see exploratory_data_analysis - 15% of this data is missing).

In [3]:
health_file = 'California - v14.1 - released May 17th, 2022/CHDB_data_tract_CA_v14.1.txt'
health_data = load_health_data(health_file).drop(['Limited access to healthy foods'], axis=1)

income_data_file = 'california_median_household_income_acs/ACSDT5Y2018.B19013_data_with_overlays_2022-06-23T200015.csv'
income_data = load_hinc_data(income_data_file)

df = pd.merge(health_data, income_data[['median income', 'stcotr_fips']])

Missing values¶

First, let's try just dropping missing values.

In [4]:
n = len(df)

print(f'Original number of data points: {n}.')

df.dropna(inplace=True)

print(f'Number of rows dropped: {n - len(df)}, {100 * (n - len(df)) / n:.1f}%.')
Original number of data points: 6075.
Number of rows dropped: 407, 6.7%.

Correlations¶

In [5]:
df['log10(median income)'] = np.log10(np.array(df['median income']))
In [6]:
column_renaming = {column_name: column_name.replace(' ', '\n') for column_name in df.columns}
In [7]:
ds = hv.Dataset(
    data=df.rename(columns=column_renaming),
    kdims=['stcotr_fips'],
)
In [8]:
(
    hv.operation.gridmatrix(ds, chart_type=hv.Points)[:, 'median\nincome'] +
    hv.operation.gridmatrix(ds, chart_type=hv.Points)[:, 'log10(median\nincome)']
).cols(
    1
)
Out[8]:

Model building¶

We will use ridge regression with built-in cross-validation, provided by the sklearn library.

In [9]:
X = df.drop(
    ['median income', 'log10(median income)', 'stcotr_fips'], axis=1
)
y = np.log10(
    np.array(df['median income']).reshape(-1, 1)
)

Preprocessing¶

In [10]:
transformer = sklearn.preprocessing.RobustScaler()
transformer.fit(X)

X_scaled = transformer.transform(X)

Model training¶

In [11]:
model = sklearn.linear_model.RidgeCV(alphas=np.logspace(-6, 6, 13))

model.fit(X_scaled, y)
Out[11]:
RidgeCV(alphas=array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01,
       1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06]))
In [12]:
model.alpha_
Out[12]:
1.0

Performance on training data¶

In [13]:
train_score = model.score(X_scaled, y)
train_score
Out[13]:
0.9490166406907403
In [14]:
y_pred = model.predict(X_scaled)
In [15]:
plot = plot_prediction_vs_actual(y_pred, y, 'median income',log10=True)
plot.opts(
    hv.opts.AdjointLayout(title=f'score: {train_score:.2f}'),
    hv.opts.Points(alpha=0.4)
)
Out[15]:

Feature weights¶

In [16]:
metrics = X.columns
metrics_sorted = np.array(metrics)[np.argsort(model.coef_)]
metrics_sorted = metrics_sorted.tolist()[0]
In [17]:
weights = model.coef_.copy()
weights.sort()
weights = weights.tolist()[0]
In [18]:
df_weights = pd.DataFrame({'metrics': metrics_sorted, 'weights': weights})
In [19]:
feature_weights = hv.Bars(df_weights, kdims='metrics')
feature_weights.opts(
    frame_width=800,
    frame_height=400,
    xrotation=45,
)
Out[19]:
In [20]:
metrics = X.columns
metrics_abs_sorted = np.array(metrics)[np.argsort(np.abs(model.coef_))]
metrics_abs_sorted = metrics_abs_sorted.tolist()[0]
In [21]:
abs_weights = model.coef_.copy()
abs_weights = np.abs(abs_weights)
abs_weights.sort()
abs_weights = abs_weights.tolist()[0]
In [22]:
df_abs_weights = pd.DataFrame({'metrics': metrics_abs_sorted, 'weights (absolute value)': abs_weights})
In [23]:
feature_abs_weights = hv.Bars(df_abs_weights, kdims='metrics')
feature_abs_weights.opts(
    frame_width=800,
    frame_height=400,
    xrotation=45,
)
Out[23]:

Although we achieved a high score on fitting to training data, we can see that 'Income Inequality' is the dominant feature, which is sort of cheating.

Model building after removing 'Income inequality'¶

In [24]:
X = df.drop(
    ['median income', 'log10(median income)', 'Income Inequality', 'stcotr_fips'], axis=1
)
y = np.log10(
    np.array(df['median income']).reshape(-1, 1)
)

Preprocessing¶

In [25]:
transformer = sklearn.preprocessing.RobustScaler()
transformer.fit(X)

X_scaled = transformer.transform(X)

Model training¶

In [26]:
model = sklearn.linear_model.RidgeCV(alphas=np.logspace(-6, 6, 13))

model.fit(X_scaled, y)
/Users/lealia/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_ridge.py:1791: RuntimeWarning: invalid value encountered in reciprocal
  w = ((singvals_sq + alpha) ** -1) - (alpha ** -1)
Out[26]:
RidgeCV(alphas=array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01,
       1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06]))
In [27]:
model.alpha_
Out[27]:
1.0

Performance on training data¶

In [28]:
train_score = model.score(X_scaled, y)
train_score
Out[28]:
0.8858283254741887
In [29]:
y_pred = model.predict(X_scaled)
In [30]:
plot = plot_prediction_vs_actual(y_pred, y, 'median income',log10=True)
plot.opts(
    hv.opts.AdjointLayout(title=f'score: {train_score:.2f}'),
    hv.opts.Points(alpha=0.4)
)
Out[30]:

We consistently underpredict high household incomes. This may be because the highest incomes were labeled as '250,000+', and I replaced that label with 250000. There are other ways to address this that I haven't implemented.

Feature weights¶

In [31]:
metrics = X.columns
metrics_sorted = np.array(metrics)[np.argsort(model.coef_)]
metrics_sorted = metrics_sorted.tolist()[0]
In [32]:
weights = model.coef_.copy()
weights.sort()
weights = weights.tolist()[0]
In [33]:
df_weights = pd.DataFrame({'metrics': metrics_sorted, 'weights': weights})
In [34]:
feature_weights = hv.Bars(df_weights, kdims='metrics')
feature_weights.opts(
    frame_width=800,
    frame_height=400,
    xrotation=45,
)
Out[34]:
In [35]:
metrics = X.columns
metrics_abs_sorted = np.array(metrics)[np.argsort(np.abs(model.coef_))]
metrics_abs_sorted = metrics_abs_sorted.tolist()[0]
In [36]:
abs_weights = model.coef_.copy()
abs_weights = np.abs(abs_weights)
abs_weights.sort()
abs_weights = abs_weights.tolist()[0]
In [37]:
df_abs_weights = pd.DataFrame({'metrics': metrics_abs_sorted, 'weights (absolute value)': abs_weights})
In [38]:
feature_abs_weights = hv.Bars(df_abs_weights, kdims='metrics')
feature_abs_weights.opts(
    frame_width=800,
    frame_height=400,
    xrotation=45,
)
Out[38]:

These metrics are much more evenly weighted.

In [39]:
import json
In [44]:
with open('metrics_sorted_by_weight.json', 'w') as f:
    metrics_json = json.dumps(metrics_abs_sorted[::-1])
    json.dump(metrics_json, f)

Income correlations with individual metrics¶

In [39]:
pearson_correlations = np.zeros(len(metrics))
for i, metric in enumerate(metrics):
    corr = df['median income'].corr(df[metric])
    pearson_correlations[i] = corr
In [40]:
df_corr = pd.DataFrame({'metrics': metrics, 'correlation': pearson_correlations.tolist()})
In [41]:
corr_bars = hv.Bars(df_corr, kdims='metrics')
In [42]:
df_weights_unsorted = pd.DataFrame({'metrics': metrics, 'weights': model.coef_.tolist()[0]})
In [43]:
feature_weights = hv.Bars(df_weights_unsorted, kdims='metrics')
In [44]:
(
    feature_weights.opts(
        frame_width=800,
        frame_height=200,
        xaxis=None,
    ) + 
    corr_bars.opts(
        frame_width=800,
        frame_height=200,
        xrotation=45,
        color='orange'
    )
).cols(
    1
)
Out[44]:

Apply to test data¶

Can we use what we learned in California to predict median income in Colorado?

In [65]:
test_health_file = 'Colorado - v14.1 - released May 17th, 2022/CHDB_data_tract_CO_v14.1.txt'
test_health_data = load_health_data(test_health_file).drop(['Limited access to healthy foods'], axis=1)

test_income_data_file = 'colorado_median_household_income_acs/ACSDT5Y2018.B19013_data_with_overlays_2022-06-23T225911.csv'
test_income_data = load_hinc_data(test_income_data_file)

test_df = pd.merge(test_health_data, test_income_data[['median income', 'stcotr_fips']])
In [66]:
n = len(test_df)

print(f'Original number of data points: {n}.')

test_df.dropna(inplace=True)

print(f'Number of rows dropped: {n - len(test_df)}, {100 * (n - len(test_df)) / n:.1f}%.')
Original number of data points: 804.
Number of rows dropped: 111, 13.8%.
In [47]:
X_test = test_df.drop(
    ['median income', 'Income Inequality', 'stcotr_fips'], axis=1
)
y_test_actual = np.log10(
    np.array(test_df['median income']).reshape(-1, 1)
)

Preprocessing¶

In [48]:
transformer = sklearn.preprocessing.RobustScaler()
transformer.fit(X)

X_test_scaled = transformer.transform(X_test)

Predict test data¶

In [49]:
y_test_pred = model.predict(X_test_scaled)
In [50]:
test_score = model.score(X_test_scaled, y_test_actual)
test_score
Out[50]:
0.49248220210869775
In [51]:
plot_test = plot_prediction_vs_actual(y_test_pred, y_test_actual, label='median income', log10=True)
plot_test.opts(
    hv.opts.Layout(title=f'score: {test_score:.2f}'),
    hv.opts.Points(color=red, alpha=0.4),
    hv.opts.Distribution(color=red)
)
Out[51]:

The model scored quite poorly on Colorado data -- but, it looks like there might be some systematic error...

In [52]:
(
    hv.Distribution(y, kdims=['log10(income)'], label='California') * hv.Distribution(y_test_actual, kdims=['log10(income)'], label='Colorado')
).opts(
    legend_position='right',
    frame_width=400,
    frame_height=300,
)
Out[52]:

Indeed, the whole distribution of Colorado incomes is shifted to the left of California incomes, which may contribute to why our model predicts a higher income than the true value consistently.

In [53]:
colorado_state_median = np.median(y_test_actual)
colorado_state_median
Out[53]:
4.825185858559518
In [54]:
california_state_median = np.median(y)
california_state_median
Out[54]:
4.850499199426245
In [55]:
delta = colorado_state_median - california_state_median

Correct Colorado data by difference in state median household income from California state median¶

In [56]:
y_test_adj = y_test_actual - delta
In [57]:
(
    hv.Distribution(y, kdims=['log10(income)'], label='California') * hv.Distribution(y_test_adj, kdims=['log10(income)'], label='Colorado (adjusted to California median)')
).opts(
    legend_position='right',
    frame_width=400,
    frame_height=300,
)
Out[57]:
In [58]:
test_score = model.score(X_test_scaled, y_test_adj)
test_score
Out[58]:
0.642861281963671
In [59]:
plot_test_adj = plot_prediction_vs_actual(y_test_pred, y_test_adj, label='median income', log10=True)
plot_test_adj.opts(
    hv.opts.Layout(title=f'score: {test_score:.2f}'),
    hv.opts.Points(color=red, alpha=0.4),
    hv.opts.Distribution(color=red)
)
Out[59]:

We can still observe some systematic bias - most predictions are still too high. However, we do see an improvement in the model score after applying the systematic correction.

In [60]:
test_df['log10(median income)'] = np.log10(np.array(test_df['median income']))
In [61]:
column_renaming = {column_name: column_name.replace(' ', '\n') for column_name in test_df.columns}
In [62]:
test_ds = hv.Dataset(
    data=test_df.rename(columns=column_renaming),
    kdims=['stcotr_fips'],
)
In [63]:
(
    hv.operation.gridmatrix(test_ds, chart_type=hv.Points)[:, 'median\nincome'] +
    hv.operation.gridmatrix(test_ds, chart_type=hv.Points)[:, 'log10(median\nincome)']
).opts(
    hv.opts.Points(color=red),
    hv.opts.Histogram(color=red),
).cols(
    1
)
Out[63]:
In [ ]: